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1 Introduction 

This project aims to estimate the pose of an object 
in the image. Pose estimation problem is known to 
be an open problem and also a crucial problem in 
computer vision field. Many real-world tasks depend 
heavily on or can be improved by a good pose 
estimation. For example, by knowing the exact pose 
of an object, robots will know where to sit on, how 
to grasp, or avoid collision when walking around. 
Besides, pose estimation is also applicable to auto¬ 
matic driving. With good pose estimation of cars, 
automatic driving system will know how to manip¬ 
ulate itself accordingly. Moreover, pose estimation 
can also benefit image searching, 3D reconstruction 
and has a large potential impact on many other fields. 

Previously, most pose estimation works were 
implemented on manually captured and labeled 
dataset, using multi-camera or depth camera. 
However, to create such a dataset is extremely 
time-consuming, laborsome, and also error-prone. 
Therefore, limited information can be learned from 
those datasets, and research-specific datasets lead 
to poor comparability among previous works. In 
this project, we instead utilized the power of 3D 
shape models. To be specific, we built a large, 
balanced and precisely labeled training dataset from 
ShapeNet [5], a large 3D model pool which contains 
millions of 3D shape models in thousands of object 
categories. By rendering 3D models into 2D images 
from different viewpoints, we can easily control the 
size, the pose distribution, and the precision of the 
dataset. A learning model trained on this dataset 
will help us better solve the pose estimation task. 

In this work, we proposed a pose estimation 
system based on rendered image training set, which 
predicts the pose of objects in real image, with 
knowledge of object category and tight bounding 
box. Although the approach is generic, we chose 
chair to be our primary research object. Our system 
takes a properly cropped chair image as input, 
and outputs a probability vector on discretized 
pose space. Given a test image, we first divide 
it into a N x N overlapped patch grid. For each 
patch, a multi-class classifier is trained to estimate 
the probability of this patch to be pose v. Then, 


scores from all patches are combined to generate a 
probability vector for the whole image. 

Although we created a larger and more precise 
training dataset from rendered images, there is an 
obvious drawback of this approach — the statistical 
property of the training set and the test set are dif¬ 
ferent. For instance, in the real world, there exists a 
prior probability distribution of poses, which might 
be non-uniform. Furthermore, even for feature com¬ 
plexity, real image features might be more diverse 
than rendered image features. In this project, we 
also focused on information transmission between 2D 
images and 3D models, therefore proposed a method 
to iteratively learn from classification results and in 
return improve classification algorithm. This novel 
approach revised the influence of different prior prob¬ 
ability distribution in training and test set. Details 
and experiment results are shown in the following sec¬ 
tions. 


1.1 Related Works 

Object pose estimation is a classical problem in 
computer vision. In general, there are two typical 
research lines: one based on 2D representation and 
the other based on 3D information. 


Among 2D based researches, 
matching, which is now outdated. 


6|7|8 rely on point 
By linking 

together diagnostic parts of object from different 
(9] represents an object category as a col- 


views, 

lection of view-invariant regions, 
and Su et al. 


Sun et al. 10 


17 used a generative approach to 


group local features into parts and then learn part 
locations across viewpoints. 11 used a SIFT-like 


18 spatial pyramids of histograms feature to train 

Inspired by 
trained a 


a SVM classifier for each discrete pose 
Deformable Part Model’s 


12 


13 


DPM using a semi-latent approach, where the com¬ 
ponents correspond to discrete viewpoints. 15 used 


convolutional neural network features for the task 
of pose estimation. 16 proposed a Hough Forest 


based method for simultaneous object detection and 
continuous pose estimation. Those widely different 
works above, although gained some achievements, 
are not learning from structural information of 
objects like human. 
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Recently, 3D model based approach achieved 
good performance on pose estimation task. 19 


extended deformable part models to 3D, where 
part appearances and spatial deformations are 
represented in 3D. Using an readymade approach, 


20 first obtained a rough localization and viewpoint 


of the object, and then estimated a continuous 
pose by using annotated 3D CAD models. Hejrati 
et al. 


21 


estimated poses of cars using an ex¬ 
plicit 3D shape model and viewpoint which is 
learned from structure-from-motion (SFM). In gen¬ 
eral, methods above rely on sophisticated handling 
of 3D models, due to the limitation of model amount. 


32 x 32 and patch stride 16 on both axes. After pre¬ 
processing, we extract a 576 dimensional HoG fea¬ 
ture [2] from each patch, so the whole image can be 
represented by a 20736 dimensional feature vector. 
Those 64,000 feature vectors constituted our training 
dataset. 



Up to our knowledge, there is no previous work 
that utilizes large 3D model database to solve pose 
estimation task. 


Figure 2: Image preprocessing and feature extraction 


2 Data Collection and 
Processing 


2.1 Training Data 

As we mentioned in Section [lj we collected our 
training data from ShapeNet, an emerging 3D shape 
model database. With 9,135135 semantically anno¬ 
tated 3D models in thousands of object categories, 
ShapeNet could provide abundant information for 
many vision tasks. In our task, we utilized those 
5057 chair models in ShapeNet. For each model, we 
rendered it on 16 viewpoints, evenly distributed on 
the horizontal circle, shown in Figure [TJ 
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Figure 1: Chair models and rendering process 


We chose 4000 models, accordingly 64,000 images 
to build the training dataset, and leave the rest 1057 
models to be our rendered image test set (validation 
set). Before extracting image features, we first re¬ 
size the images to 112 x 112 pixels, and then divide 
it into 6x6 overlapped patch grid, with patch size 


2.2 Test Data 

To comprehensively evaluate the performance of 
our learning algorithm, we built three different test 
sets with increasing level of test difficulty. They are 
rendered image test set, clean background real image 
test set and cluttered background real image test set. 

Rendered image test set, as we mentioned in Sec. 
|2.1[ consists of 1057 x 16 rendered images, which also 
comes from ShapeNet. Clean background and clut¬ 
tered background real image test sets are collected 
from ImageNet [ 3 ], containing 1309 and 1000 im¬ 
ages respectively, both with manually labeled ground 
truth of viewpoint. Some sample images are shown 
in Figure [ 3 ] Obviously, these three datasets are in¬ 
creasingly noisy and thus difficult to tackle. 




Figure 3: Clean background & cluttered background 


For image preprocessing and feature extraction on 
the test sets, we used the same scheme as the train- 
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ing set. That is, convert each image into a 20736- 
dimensional HoG feature. 


3 Model 

Rather than using global image feature as the 
input of classification, our pose estimation model is 
patch-based. By dividing image into patches and 
training a classifier for each patch, our model can 
be more robust to occlusion and background noise. 
Also, this approach reduced the feature dimension 
for each classifier, thus reduced the sample com¬ 
plexity. Actually, we did try global features, while 
the classification accuracy is 30% lower than patch 
based method on clean background test set, shown 
in Table [l] The mathematical representation of our 
patch based model is as follows. 

Define P* as the HoG feature of patch i, 
/ = (iA, • • •, F N 2 ) to be the HoG feature of 

the whole image, V = {1,---,U} to be the dis¬ 
cretized pose space. For each patch, we build a 
classifier, which learns from training data, and gives 
a prediction of the conditional probability P(v\Fi). 
To respresent P{y\I) in P(y\Fi), i = 1 ,---,7V 2 , we 

N 2 

assume P(v\I) oc Yl P{ v \Pi )• So, we can calculate 
2=1 

P(v\I) and the according v using the following 
formula: 


N 2 

n p(v\Fi) 

P(v\I) = —- 

V 1 ' V N 2 

e n pw) 

V=1 2=1 

v = argmaxP(f|/) 

V 

In sum, our model takes i^, i = 1, • • •, N 2 as input, 
and outputs P(v\I) and v. 

4 Methods 

4.1 Learning Algorithms 

4.1.1 Random Forest 

In this project, we choose random forest [I] as a pri¬ 
mary classification algorithm based on its following 
advantages: 


• Fast, easy to parallel. 

• Robust, due to randomized processing. 

During classification, 36 random forest classifiers 
are trained for 36 patches. As a trade off between 
spatio-temporal complexity and performance, we set 
the forest size to be 100 trees. We also tuned the 
maximum depth of trees using cross-validation, where 
the optimal depth is 20. As a result, each random 
forest outputs a probability vector P{v\Fi). After 
Laplace smoothing, we calculated P(y\I), estimated 
the pose to be v = argmaxP(T|/). 

V 

4.2 Optimization 

Constructing training dataset from rendered images 
has many advantages, but there are also drawbacks. 
As I mentioned in Section [TJ the prior probability of 
pose in real images can be highly different from that 
in rendered images. As we know, pose distribution in 
the training set is uniform, however, in real images, 
there are far more front view chairs than back view. 
Fortunately, this difference can be analyzed and mod¬ 
eled as follows. 


4.2.1 Probability Calibration 


In classification step, each classifier Ci will output a 
probability vector P(v\Fi). Using Bayesian formula, 
we have: 


P(v\Fi) 


P(v)P(Fj\v) 

P(Fi) 


Although we may not learn P(v), P(F\v) and P(P) 
explicitly when training, we can use them to indicate 
the statistical property of training data. Whereas, 
the real P(v\Fi ), which satisfies the following formula, 
could be different from P{v\Fi). Here, P(v), P{Fi\v) 
and P(Fi) are statistical properties of the test set. 


P(v\Fi) 


P(v)P(Fi\v) 
P(Fi ) 


Assume the training data and the test data have at 
least some similarity. Specifically speaking, assume 
P(Fi\v) = P(P;|?;), P(Pi) = P(Fi ), then we have: 


P(v\Fi) = P(v\Fi)S& oc P(v\Fi)P(v) 
P(v) 


To recover P{v\Fi), we just need to achieve a good 
estimation of P(v). One possible method might be 
randomly choosing some samples from the test set, 
and manually label the ground truth of viewpoint, 


• Suitable for multiclass classification. 

• Non-parametric, easy to tune. 
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regard the ground truth pose distribution of samples 
as an estimation of overall P(v). However, we still 
need to do some “labor work” — labeling. 


Noticing the above formula can also be written as: 


pw) 

P{v) 


P(v\Fi) 
P(v) ’ 


P(v) = t VweV 


we came up with another idea to automatically im¬ 
prove the classification result. For P(v\Fi), we have: 

P(v) P(v\Fi) < P(v\Fi) 

P(v) < k => P(v\Fi) > P(v\Fi) 

That means, when testing, frequently appeared 
poses are underestimated, while uncommon poses 
are overestimated. Here, we propose an iterative 
method to counterbalance this effect. Basically, we 
will use P{v\Fi) to generate an estimation P(v) of 
the prior distribution; assume P(v) and P(v) have 
similar common views and uncommon views (in other 
words, P(v) and P(v) have the same trend); smooth 
P(v) to keep the trend while reduce fluctuation 
range; multiply the original P(v\Fi) by smoothed 
P(v); and iteratively repeat the above steps. Finally, 
due to the damping effect in combination step, P(v) 
will converge, and P(v\Fi) gets closer to P(v\Fi). 
Formulation of this iterative algorithm is as follows: 


4.2.2 Parameter Automatic Selection 

After several iterations, the algorithm will converge, 
and we’ll get a final estimation P(v\Fi) of P(y\Fi). 
However, different a will lead to far different converg¬ 
ing results, as shown in Figure [4j From experiment 
results in Figure [5] we observed that if a is too small, 
viewpoint with the highest initial probability P(v) 
will soon beat other viewpoints, and P(v) converges 
to a totally biased distribution. While, if a is too 
large, smoothing effect is too strong to influence 
P(v\Fi), resulting in P(v\Fi) = P(y\Fi). However, 
there exists an intermediate value of a to maximize 
the classification accuracy and lead to an optimal 
estimation P(v\Fi). In Figure [ 4 ] and [ 5 J a opt is 0.8. 



1. Calculate P(v |/^), j = 1, • • • ,m. 

N 2 

n p(v\F t U) ) 

P(V\PP) = -— 

e n pwfP) 


2. Accumulate P(v\PP) on all test samples to cal¬ 
culate P(v). 


P(v) 


1 

m 


3 = 1 


3. Smooth P(v) by factor a. 


Ps(v) = 


P(v) + a 
l + 16 a 


Figure 4: Classification accuracy change w.r.t. a 



Figure 5: Stable distribution P(v) w.r.t. a 


4. Estimate P{v\Fi) by letting: 

P(v\Fi) = P(v\Fi)P a (v) 

5. Use P(v\Fi) to re-calculate P(v\pP) in step 1, 
while remain P(v\Fi) in step 4 unchanged, repeat 
the above steps. 


To solve the optimal a, we conducted deep analysis 
to the relationship between stable P(v) and a. We 
found three patterns of relationship between P(vj) 
and a, shown in Figure [ij For some viewpoints, P(v) 
is almost monotonically increasing with respect to a, 
such as blue curves, some are monotonically decreas- 
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ing, such as the black curve, while others will decrease 
after first increase, such as the red curves. Recall the 
distribution change with respect to a in Figure [5| 
we found P(v) will first approximate P(v) then be 
smoothed. Therefore, patterns with turning points 
are good reflection of this trend. Sum on those com¬ 
ponents, we get Figure [7| and take the turning point 
of the curve as our estimated a. Here a is 1, very 
close the optimal value <a opt = 0.8. 



Figure 6: P(vj) curve with respect to a 



Figure 7: Estimated a 

5 Results 

5.1 Classification Performance 

In Table [lj our patch based random forest classifica- 
tion algorithm (denoted as RF) shows a promising 



Render 

Clean 

Cluttered 

RF(%) 

96.16 

80.67 

76.80 

RF opt (%) 

— 

88.90 

78.70 

RF G t(%) 

— 

91.29 

81.00 

Global(%) 

97.03 

52.64 

10.90 


Table 1: Classification accuracy on three test sets 

classification results on all three test sets. Under 
our scheme, random forest achieves 80% accuracy on 
clean background real image test set, and 77% on 
cluttered background test set. After calibrating the 
conditional probability P(y\Fi) using automatically 
selected a (denoted as RF opt ), performance on 
clean test set is boosted by 8%, as well 2% on 
cluttered set. The relatively low improvement on 
cluttered test set may result from our assumption of 
P(Fi\v) = P(Fi\v) and P(Fi) = P(Fi) are too strong 
for cluttered images. 

Row “RFqt” shows the result of calibrating 
P(v\Fi) using ground truth P(v). Compared 
to our optimization approach, the accuracy is only 
2% higher, indicating the effectiveness of our method. 

Besides, the “Global” row shows a terrible classi¬ 
fication performance on global image features. Al¬ 
though it achieves best result on rendered images, 
performance drops significantly when testing on real 
images. One possible explanation might be that 
global classifier overfittingly learned the importance 
of patches from training set, while patch importance 
on real images is different. Figure [8] verified this hy¬ 
pothesis. In contrast, patch based method gives equal 
importance to all patches, hence reduced overfitting. 


m 
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Figure 8: Patch importance on rendered, clean, and 
cluttered test sets. Learned by training a global ran¬ 
dom forest classifier on three datasets. 

Figure [9] shows the confusion matrix on three test 
sets respectively. From left to right, as test diffi¬ 
culty increases, confusion matrix becomes increas¬ 
ingly scattered. On rendered image test set, an in- 
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teresting phenomenon is that some poses are often 
misclassified to poses with 90° difference with them, 
one possible explanation is that the shape of some 
chairs are like a square. Also, front view and back- 
view are often misclassified, because they have similar 
appearance in feature space. 



Figure 9: Confusion matrix on rendered, clean, and 
cluttered test sets 


6 Conclusion 

In this paper, we proposed a novel pose estimation 
approach — learn from 3D models. We explained 
our model in Bayesian framework, and raised a new 
optimization method to transmit information from 
2D images to 3D models. The promising experiment 
results verified the effectiveness of our scheme. 

7 Future Work 

We have several ideas for the future work, described 
as follows: 

• Take into consideration the foreground and back¬ 
ground information in the image, fully utilize the 
information in rendered images. 

• Further model the difference between three 
datasets, revise our inaccurate assumption. 

• Learn the discriminativeness of patches, give dif¬ 
ferent weight for different patches. 

• Generalize our algorithm to occluded images, or 
different categories, see what will happen. 
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